AI Alignment Evals Hackathon
We want our AIs to have certain values in robust and scalable ways. There are methods proposed to get the values into the models - however it's highly ambiguous how well they actually work. Let's remove that ambiguity.
Join our week-long AI Alignment Evaluations Hackathon, running from January 25 to February 2, 2024, where researchers, developers, and curious minds come together to tackle the challenge of measuring and advancing AI alignment. This is a chance to think critically, experiment with ideas, and contribute to making progress on one of the most important problems of our time.
What are AI alignment evaluations and why do they matter?
AI alignment evaluations are tools designed to measure how well AI systems align the intended values and goals. These tools systematically assess what a model can do (capabilities), and whether that model chooses to behave in ways consistent with the intended values when it can (alignment).
Alignment evaluations often involve testing models across datasets to estimate the likelihood of aligned behavior in controlled settings. They provide measurable insights that help us understand, refine, and improve AI systems, contributing to efforts to make them safer and more aligned with specific intentions.
Some notable examples of alignment evaluations are:
DecodingTrust: Provides a comprehensive trustworthiness evaluation of GPT models.
MACHIAVELLI: A benchmark of 134 Choose-Your-Own-Adventure games testing scenarios that center on social decision-making.
RuLES: Measures the rule-following ability of LLMs.
SALAD-Bench: A safety benchmark designed for evaluating LLMs, defense, and attack methods.
TruthfulQA: Measures whether a language model is truthful in generating answers to questions.
What can you expect from joining the hackathon?
Contribute to the future of AI alignment by gaining hands-on experience in designing your own evals.
What you’ll learn
Participating in this hackathon is an opportunity to test and expand your understanding of alignment evaluations, while building your portfolio. Here’s what you’ll get to learn:
How to design a benchmark, from defining success metrics to setting up test cases and interpreting results.
How to use existing benchmarks and apply them to real-world use cases.
How to fine-tune models and evaluate their impact on alignment outcomes.
How to develop adversarial test cases to identify weaknesses in current benchmarks.
How to train a cross-coder to compare fine-tined models against their base counterparts.
The goal is to design benchmarks that test if a model consistently upholds its intended values (robust) and applies them effectively even in out-of-distribution scenarios (scalable). By the end of the hackathon, you will be equipped with practical tools and methods for evaluating and improving AI alignment.
What we can provide
As a participant, we will provide you with:
10 versions of a model, all sharing the same base but trained with PPO, DPO, IPO, KPO, etc.
Step-by-step guides for creating evals (i.e., what is it, how to run an eval, things to consider when making one, how to make one, etc.).
Tutorials on using HHH, SALAD-Bench, MACHIAVELLI, and more.
An introduction to Inspect, an evaluation framework by the UK AISI.
How does the red team vs blue team challenge work?
As a participant, you can either be part of the red team or a blue team.
The red team's challenge: Make the Trojans
Your goal is to test the limits of existing alignment benchmarks and expose their weaknesses. Using the resources provided:
Find cases where models pass alignment benchmarks but still fail to uphold the intended values or behaviors.
Craft adversarial inputs and edge cases to uncover vulnerabilities or unexpected behaviors.
Highlight specific ways existing benchmarks fall short, such as allowing models to score highly while exhibiting misaligned behavior the benchmark was meant to filter out.
The blue team's challenge: Build better benchmarks
Your goal is to strengthen alignment benchmarks and propose new ones. Using the resources provided:
Adapt existing (or create new) benchmarks to test if the intended values were embedded in the model (e.g., comparing a fine-tuned model to its base using a cross-coder or testing if the model learned an abstraction).
Create new test cases or evaluation metrics that address previously overlooked aspects of alignment.
Ensure benchmarks are applicable across a variety of models and training approaches.
How can you participate?
You don't need to be an evals expert to join the hackathon. All you need is basic Python programming experience and a high-level familiarity with neural network training concepts. You can learn the rest in the hackathon! To participate:
Register your interest via https://lu.ma/xjkxqcya, and indicate your preferred role.
Join the Discord server and attend the events. There will be paper reading sessions every 5pm UTC and office hours at 6pm UTC leading up to the hackathon.
Form a small team. Discuss your ideas with other participants and form your team of 2-5 members. The deadline for finalizing teams is on 24 January 2025.
Get started. Attend the kickoff event on 25 January 2025 to receive resources, guidance, and tips, and then begin working on your project.
Submit your initial work. Let us know what you're working on by 28 January 2025.
Submit your final work. Submit your final project, including documentation and code, by the deadline on 31 January 2025. Submissions should include a GitHub repo of your work and a 2-page writeup summarizing your approach.
Present your work. The top submissions for each track will be invited to present their results during the Presentation Evening on 02 February 2025.
There will be events all throughout the month on the AI-Plans Discord server such as paper reading sessions and office hours to help you form an idea and a team.
Are there any prizes?
We’re giving away USD 150 worth of prizes for this hackathon:
USD 50 for the Red Team with the most "high-score but unaligned" results on safety benchmarks.
USD 50 for the Blue Team that creates the best safety benchmark (as determined by peer review).
USD 50 for the Blue Team whose benchmark catches the most misaligned models/attempts by Red Teams.
Where will the hackathon be held?
The hackathon will take place online on the AI-Plans Discord. If you'd like to host a local gathering onsite for your community, you can let us know by filling out this form: https://tally.so/r/wvENk8.
Want to be a collaborator?
We're looking for experts, organizations, and passionate individuals to help make this hackathon a success! If you'd like to contribute as any of the following, kindly let us know via by selecting the Collaborator ticket on this RSVP page.
Mentor: Offer mentorship, insights, or tools to participants as they work on their project.
Judge: Evaluate and provide feedback on participants' projects during the hackathon.
Sponsor: Support the event with resources or prizes.